37 research outputs found

    Improve the Performance and Scalability of RAID-6 Systems Using Erasure Codes

    Get PDF
    RAID-6 is widely used to tolerate concurrent failures of any two disks to provide a higher level of reliability with the support of erasure codes. Among many implementations, one class of codes called Maximum Distance Separable (MDS) codes aims to offer data protection against disk failures with optimal storage efficiency. Typical MDS codes contain horizontal and vertical codes. However, because of the limitation of horizontal parity or diagonal/anti-diagonal parities used in MDS codes, existing RAID-6 systems suffer several important problems on performance and scalability, such as low write performance, unbalanced I/O, and high migration cost in the scaling process. To address these problems, in this dissertation, we design techniques for high performance and scalable RAID-6 systems. It includes high performance and load balancing erasure codes (H-Code and HDP Code), and Stripe-based Data Migration (SDM) scheme. We also propose a flexible MDS Scaling Framework (MDS-Frame), which can integrate H-Code, HDP Code and SDM scheme together. Detailed evaluation results are also given in this dissertation

    When Distributed Consensus Meets Wireless Connected Autonomous Systems: A Review and A DAG-based Approach

    Full text link
    The connected and autonomous systems (CAS) and auto-driving era is coming into our life. To support CAS applications such as AI-driven decision-making and blockchain-based smart data management platform, data and message exchange/dissemination is a fundamental element. The distributed message broadcast and forward protocols in CAS, such as vehicular ad hoc networks (VANET), can suffer from significant message loss and uncertain transmission delay, and faulty nodes might disseminate fake messages to confuse the network. Therefore, the consensus mechanism is essential in CAS with distributed structure to guaranteed correct nodes agree on the same parameter and reach consistency. However, due to the wireless nature of CAS, traditional consensus cannot be directly deployed. This article reviews several existing consensus mechanisms, including average/maximum/minimum estimation consensus mechanisms that apply on quantity, Byzantine fault tolerance consensus for request, state machine replication (SMR) and blockchain, as well as their implementations in CAS. To deploy wireless-adapted consensus, we propose a Directed Acyclic Graph (DAG)-based message structure to build a non-equivocation data dissemination protocol for CAS, which has resilience against message loss and unpredictable forwarding latency. Finally, we enhance this protocol by developing a two-dimension DAG-based strategy to achieve partial order for blockchain and total order for the distributed service model SMR

    Adaptive incentive for cross-silo federated learning: A multi-agent reinforcement learning approach

    Full text link
    Cross-silo federated learning (FL) is a typical FL that enables organizations(e.g., financial or medical entities) to train global models on isolated data. Reasonable incentive is key to encouraging organizations to contribute data. However, existing works on incentivizing cross-silo FL lack consideration of the environmental dynamics (e.g., precision of the trained global model and data owned by uncertain clients during the training processes). Moreover, most of them assume that organizations share private information, which is unrealistic. To overcome these limitations, we propose a novel adaptive mechanism for cross-silo FL, towards incentivizing organizations to contribute data to maximize their long-term payoffs in a real dynamic training environment. The mechanism is based on multi-agent reinforcement learning, which learns near-optimal data contribution strategy from the history of potential games without organizations' private information. Experiments demonstrate that our mechanism achieves adaptive incentive and effectively improves the long-term payoffs for organizations

    MARS: Exploiting Multi-Level Parallelism for DNN Workloads on Adaptive Multi-Accelerator Systems

    Full text link
    Along with the fast evolution of deep neural networks, the hardware system is also developing rapidly. As a promising solution achieving high scalability and low manufacturing cost, multi-accelerator systems widely exist in data centers, cloud platforms, and SoCs. Thus, a challenging problem arises in multi-accelerator systems: selecting a proper combination of accelerators from available designs and searching for efficient DNN mapping strategies. To this end, we propose MARS, a novel mapping framework that can perform computation-aware accelerator selection, and apply communication-aware sharding strategies to maximize parallelism. Experimental results show that MARS can achieve 32.2% latency reduction on average for typical DNN workloads compared to the baseline, and 59.4% latency reduction on heterogeneous models compared to the corresponding state-of-the-art method.Comment: Accepted by 60th DA
    corecore